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Abstract 

In this paper we report the prototype of the first coalescing binary detection 
pipeline fully implemented on NVIDIA GPU hardware accelerators. The code 
has been embedded in a GPU library, called culnspiral and has been developed 
under CUDA framework. The library contains for example a PN gravitational 
wave signal generator, matched filtcring/FFT and detection algorithms that 
have been profiled and compared with the corresponding CPU code with dedi- 
cated benchmark in order to provide gain factor respect to the standard CPU 
implementation. In the paper we present performances and accuracy results 
about some of the main important elements of the pipeline, demonstrating the 
feasibility and the chance of obtain an impressive computing gain from these new 
many-core architectures in the perspective of the second and third generations 
of gravitational wave detectors. 
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Recently, innovative solutions have been implemented by CPU producers, 
introducing highly parallel design in processors architecture. The productive 
process is evolving from the " multi-core" era toward the " many-core" era, where 
hundreds of thousands of processing core are embedded on the same processor. 
Such kind of changes on architectural paradigm require an identical change on 
the programming paradigm. Modern, GPUs(Graphics Processing Unit) fall in 
the manycore definition, being characterized by hundreds of processing cores. 
Moreover due to the inherently parallel nature of 2D and 3D computer graphics 
problems, GPUs are well suited for more general problems, performing complex 
computation in very general fields. Since an impressive develop is taking place 
in the GPU technology, it makes sense to explore GPU performances not only 
for graphics, but also for other applications. The achievable speedup on GPU is 
strongly correlated to the algorithms implementation, which can be more than 
2 orders of magnitude under optimal conditions. Therefore we can claim that a 
so big improvements in performances has an effect also on how science is made, 
giving access to new solutions that were unaccessiblc or very difficult to achieve 
before. 

There are some important news also under the GPU devices programming 
simplification and unification. On that we have to remark CUDA architecture [7[ 
developed by NVIDIA and OpenCLQ, developed inside the Khronos Consor- 
tium. This last project defines a common language for parallel computing de- 
vices and it is supported by the major semiconductor and hardware vendors. 
The work shown in this paper has been developed on CUDA. 



In this paper we show the implementation of these technologies in the context of 
the gravitational wave astronomy. In particular we present the first prototype 
of a detection pipeline for coalescing binaries (CB) gravitational wave signal 
completely GPU code based. 

Actually the two most important ground-based gravitational-wave (GW) 
detectors had completed long science data taking, with a very iinpressive sen- 
sitivity close to their design performance [H 0. These are Virgojj, 0] in Italy 
and LIGOQ in United States. During the next years this first generation of 
detectors will undergo to several upgrades, moving toward enhanced and ad- 
vanced versions. The direct detection of the first GW signal is expected to be 
made by Advanced versions, where the CB events rate is enough to have an high 
degree of confidence. About third generation, in 2009, the Einstein Telescope[6j 
project stared. This is a design study about detectors of third generation and 
is supported by the European Commission under the Framework Program 7 
(FP7; Grant Agreement 211743). It concerns the study and the conceptual de- 
sign for a new research infrastructure that will bring Europe to the forefront 
of the most promising new development in our quest to understand the history 
and future of the Universe, the emergence of the field of Gravitational Wave 
Astronomy 13l [IJ] . The gain in terms of sensibility between first ground based 
gw detector and ET like detectors is expected to be 2 order of magnitude more, 
increasing exponentially the demanded data analysis computing power. 
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One of the most promising sources of gravitational wave signal for detection 
purpose are the coalescing binaries systems, which consist on two fast rotating 
compact objects, such as neutron stars and black holes, that loses orbital energy 
under gravitational wave emission Q. 

The algorithm used for the optimal detection procedure is based on matched 
filtering (MF) technique 15|, |l6[ , where the detector signal is " compared" via 
correlation with a set of expected theoretical signals, called templates. This 
technique is very computing demanding, because for a complete analysis an 
high number of templates (the MF filters) have to be considered and moreover 
because the core algorithm is the Fast Fourier Transform(FFT)[l^. 

The templates number is a critical issue, depending strongly on the detector 
sensitivity and on the analysis accuracy. Actually we are facing the detection 
procedure using roughly some thousands of templates. This amount is enough 
for the first generation detectors, but if wc consider ET generation detectors, 
we have estimated a number of 1-2 million of templates to perform the same 
detection strategy, increasing the complexity of some orders of magnitude. 

In this paper we report the first coalescing binary detection pipeline, com- 
posed by template bank generation, matched filtering procedure and maxim 
estimation based completely on GPU. In the next section we describe this items 
with more details. 

With this work we demonstrate the feasibility to use these architectures 
for gravitational wave physics purpose, obtaining a factor 50-100 on the actual 
performances and the applicability to cover computational problems related to 
the detectors of third generation as ET with the perspective of the manycore 
computing era. 



1. Graphics Processing Processor (GPU) and CUDA 

The prototype pipeline presented in this work has been developed using 
CUDA, the computing engine of the NVIDIA Processors. The code has been 
written programming with " C for CUDA" , the specific environment to compile 
code for NVIDIA GPU. In this framework the programmer can define special 
C functions, called kernels, that, when called, arc executed N times in parallel 
by N different CUDA threads into GPU core processors. In some sense this 
architecture is an extension of SIMD model. In detail, when a kernel is executed, 
N copies of it are generated and executed in parallel inside GPU device; moreover 
at each thread a unique id is assigned, starting from to TV — 1, which can be 
used as unique identifier inside the kernel code for several purpose, such as 
accessing memory. 

In the CUDA programming model, CUDA threads execute on a physically 
separate device, operating in some sense as a coprocessor, respect to host com- 
puter that runs the C program. Sometime this programming model is reported 



^At such scale the brute force is not the best solution, and the procedure can be made 
more skilled, permitting to speedup the analysis. 
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as STMT (Single Data Multiple Threads) as likeliness to SIMD, where instead of 
single data multiple data, a single kernel execution produces a multiple pool of 
threads running. 

CUDA has some advantages respect to other general purpose computation 
on CPUs (GPGPU)[2|: 

• Scattered reads code can read from arbitrary addresses in memory. 

• Shared memory - this can be shared among threads and can be used as a 
user-managed cache, and permitting higher bandwidth. 

• LI and L2 cache on Fermi architecture 

• Faster downloads and read backs to and from the GPU 

• Full support for integer and bitwise operations, including integer texture 
lookups. 

• Very good support and SDK 

The code presented in this paper has been tested on NVIDIA GTX275 board 
(GT200 series), characterized by 240 computing core, GDDR3 160 GB/s Mem- 
ory, ITFlops peak performance and a consumption of 200W. 

2. Coalescing binary signal detection problem 

In this section a brief introduction to the gravitational wave signal charac- 
teristics and detection procedures are reported. 

Gravitational waves have been predicted by A. Einstein in 1916 as general 
relativity theory implication[loj . Under particular circumstances, accelerated 
masses could produce a perturbation in the space-time that propagates as a 
wave. Gravitational radiation has not yet been directly detected, but there 
are many indirect proofs of its existence, it was remarkable the Nobel Prize 
in Physics, awarded for measurements of the Hulse- Taylor binary system in 
19930. 

Sources of gravitational waves include binary star systems composed by com- 
pact objects like Neutron Stars or Black Holes. In these systems orbital energy 
is carried out via gravitational wave, producing a progressive reduction of the 
relative distance between the two stars. This process continues since the coales- 
cence event, where the two starts stop to live as distinct objects, beginning to 
merge. Coalescing compact binaries with neutron star or black hole components 
provide the most promising sources of gravitational radiation for detection by 
the interferometric detectors experiments. 

Gravitational waves (GW) are transverse waves, which effect on metric dis- 
tortion perpendicularly to the propagation direction. A GW can be expressed 
by two polarizations: h(t) = Q!h-|.(t) -I- /3hx(t). The prediction about the sig- 
nal shape can be made theoretically, but, due to the complexity of the general 
relativity equations, it can be made only with approximation methods. The 
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expression describing the signal is function of a set of parameters, which de- 
pend on source physical characteristics [ll , 12 1. A typical set of parameters is 
{1711,1712, R,i, (p), where: 



• ml;m2 - star masses 

• R - relative distance between the binary system and the observer 

• i - inclination angle 

• (j) - initial phase 

• to - arrival time 

2.1. Matched filtering 

On defining the detection strategy a big lack is about signal knowledge, 
because we don't know which and when the binary system will emit a GW. 

Under the conditions of two polarization signal with a set of unknown pa- 
rameters, the detection theory of signal provides a procedure based on Neyman- 
Pearson method, permitting the definition of the optimal detector. In this case 
the "detection rule" is based on matched filtering techniaue[l6|. that is the op- 
timal linear filter that maximizing the signal-to-noise-ratio (SNR) in presence 
of additive stochastic noise. 

A matched filtering between two signals is obtained correlating a template 
of the signal that we are looking for, used as reference signal, with the stream 
of data where is assumed to be present the signal itself. This operation is 
equivalent to convolve the input data stream with a conjugated time-reversed 
version of the template (cross-correlation). 

In term of computational complexity, we use to apply matched filtering in 
frequency domain, where it is possible to benefit of the FFT algorithm [l8j, 
having a complexity of 0(iV log(7V)), where N is the length of the two vectors 
of data and transforming the time correlation in a complex vector product. 

Technically, given the input stream of data x{t), the unknown signal 6{t) 
that we are looking for and the power spectrum of the noise S(f), we can define 
the matched filter in presence of colored noise as: 

.This formula can be use to define the matched filter as an inner product. 

In case of Gaussian-white noise, the previous expression is reduced to the 
standard correlation function [l^ ■ For that the 12.11 is sometime referred as 
correlator. 

Due to what has been previously reported, the set of parameters {ml, m2, R, i, </>) 
describing the system is completely unknown. Under these hypothesis the de- 
tection procedure is completely blind and require to use a set of templates, 
which are built a priori. Each template is associated with a point in the masses 
parameters space (ml,m2).i, describing the signal sources, and the parameter 
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space is defined in order to cover the region where signal is expected to be. The 
template placement is sampled in such a way to have a tolerable mismatch error 
between close templatetH. The number and density of points, composing the 
grid, has to be enough and correctly distributed in order to reduce the possibil- 
ity of a signal detection miss, due to the mismatch between the real signal and 
the templates bank filters. 

The procedure!!^ [21I ? ] needs to compare each grid point, corresponding 
to a specific template i/(ml, TO2)i, with the data stream using the matched 
filtering and than to look for the values above a certain predefined threshold, 
which depends on detection probability and the a priori false alarm decision[l6j. 

We can synthesize the pipeline steps for each input data chunk, as follow: 
- Loop over template bank, where for each parameters couple (ml^, m2i): 

1. generate and normalize the template 6{t, mli, m2i) in order to be < 6{t), 6{t) > = 
1 

2. calculate the two correlators with matched filter for the two polarizations: 

cXit) x{t),0+{t) >,cx(i) =< x{t),0,,{t) > 

3. add two polarization in quadrature, c*(t) ~ ^ c!|_(t)^ + (t)^ 

4. look for {ig} = max{d{t) > A), where A is a predefined threshold. The 
set is the ensemble of the candidate arrival time. The amplitudes c(iQ) 
give the signal-to-noise ratio, proportional to the detection probability. 

- Reduce the [c*(<o)] set, in order to detect local maximum 



3. Pipeline Implementation 

The prototype pipeline for coalescing binary signal, reported in this paper, 
is based on the procedure described in the previous section. In detail the code 
loads the template bank data from an ASCII file and than it follows the detec- 
tion procedure as reported in figure [T] In order to obtain the maximum benefit 
on using GPU, it fundamental to make all possible computation on GPU board, 
reducing or hiding as much as possible data exchange between GPU memory 
and PC main memory and increasing as much as possible the GPU computa- 
tional density. By these consideration we can conclude that the approach where 
the GPU is used just as a simple coprocessor it is not the best one (for example 



wrapping just some kernelfunctions, such as FFT[17[), because the host /device 
10 can dramatically reduce performance respect to what these new technologies 
can express. For that, with the purpose of exploring such new multicore archi- 
tecture, we have developed culnspiial, a prototype library, containing functions 
for the CB detection pipeline fully working on GPU device. 



■^Usually a 2-5% of mismatch error is used 

•^This means to make, if it is possible, data lO during GPU computation. This permits to 
hide data transfer weight to the global computation time. 
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Figure 1: In this plto we report the rough design of culnspiral pipeline and the user level 
functions involved in the computation On the left side is represented the host computing side 
and on the right the GPU computing side. The pipeline in mainly characterized by template 
generation, normalization, FFT, matched filtering product in frequency domain, iFFT, events 
detection and reduction 



In our work the pipeline has been developed defining the GPU library with 
a set of CUDA kernels and host functions which execute the CB pipeline algo- 
rithms all inside GPU device, except for some "reduction" algorithms that need 
CPU contribute. In this section we report about this CB signal detection code 
based fully on GPU, showing the achieved speed up respect to similar CPU 
implement at ion . 

3.1. Pipeline main details 

The core of the analysis is made in frequency domain, in order to get vantage 
from the DPT algorithm. In detail we use CUFFT library, the FFT implemen- 
tation for nvidia GPUji^]. In figure [T] we report a schema of the CB pipeline, 
dividing the data/work flow between host part (left side) and GPU part (right 
side) . In the same picture we report also the name of the library functions in- 
volved at each step. In order to estimate GPU/CPU speedup, we have created 
a CPU counterpart code for some of the GPU routines. This choice permits 
us to perform performances and accuracy analysis, using the CPU results as 
reference. 
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Here we report a brief description about some of the pipeline functions, as 
shown in[T] 

host_template_init this function acts on host side and it initiahzes the gener- 
ator constants, starting from star masses (ml,m2) of the template bank, 
passed as arguments. 

device_template_init initialized constants are passed to the GPU memory. 

device_template_proc this command calls the generator kernel, filling the 
GPU memory with the template data (both polarizations) . At this stage 
the generation is completed. This function accepts some optional 
FLAGS used to control, for example, template normalization, template 
domain generation and host memory synchronization. The normalization 
code is based on a distributed reduction algorithmfi^l that involve GPU 
shared memory and CPU in order to speed up computation. 

device_complex_mf this function applies matched filtering formula in fre- 
quency domain between each polarization and the input data chuck 5, 
using the 12.11 The two outputs of this step arc called single polarization 
" correlator" . 

device_cufft_C2R this function applies inverse DFT on each previous polar- 
ization correlator output, in order to go back in time domain. 

device_madd2 this function combines together the two polarization correla- 
tors output using the l2.1l The output is a time domain vector c[t\, called 
" correlator output" . 

device_maxfind this function is used to look for "correlator output" values 
above threshold, than could be associated with true detection. This func- 
tion is implemented using a distributed reduction algorithm that involves 
GPU, GPU shared memory and CPU in order to speed up this process. 
The clusterization phase is made only by CPU. This chose permits us to 
hide this computing part behind the next iteration. 

3.2. FFT and memory access 

The core of the matched filtering technique is the DFT algorithm, permit- 
ting to speed up the correlations computation. This algorithm has a numerical 
complexity of 0{Nlog{N)) respect to the time correlation, which has complexity 
of 0{N'^). NVIDIA provides a specific library to compute FFT on GPU, called 
CUFFT. In the context of Data Analysis, FFT is one of the most common 
and important algorithm, used for example for spectral analysis, correlation 
and convolution and other common tasks. For that reason, we developed a 
specific test to evaluate its performances, comparing CUFFT results with the 
same CPU implementation of the benchmark, using the famous fftw library [l^ . 
At the moment, the main limitation of cufft is the support for single precision 
only, that could produce some numerical problems when the signal dynamics is 
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getting higher. In order to reproduce the same conditions on both GPU and 
CPU architectures, the fftw hbrary has been compiled in single precision, single 
thread with sse supporl[f|. 

The benchmark has been designed to record FFT execution time on different 
size vectors, as show in figure^] In order to get statistic, the code loops the FFT 
function over the same vector length^. In figured] we report the comparison be- 

GPU/CPU FFT performances 

processing time and GFIops 




Figure 2: We report the performances comparison between FFTW and CUFFT library. We 
build a specific test that measure the execution time and GFIops of the two implementation 
while vector length change. In the CB analysis usually we work arount vector of length 2^®, 
where CUFFT gets a x50 performance gain. 

tween cudaFFT(GPU-triangle) and fftw(CPU-circle) performances, measuring 
the average processing time (dash line) and the GFIops (line). As defined in 
fftw documentation [l9[ in realtocomplex case, GFIops are estimated as: 

5Nlog2iN)At2 

where "At" is the average FFT computing time expressed in microseconds. For 
the cufft library case we use 5Nlog2{N)/ At formula, because at the moment 
cudafft doesn't support an optimized routine for real to complex transformation. 
In fact the library function hides a complex to complex transformation. 

Obtained results show how, for a given vector length, the fft on GPU is many 
tens time faster than the fftw implementation. In CB signal detection analysis, 



■^Usually, the fftw single precision gains the performance of 30% respect to double precision 
^The number of loops depends on the vector size. For example, for a vector of length 

4194304, the number of loops per vector is 2000 and that number change inversely proportional 

with the vector length. 
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vectors with a length greater than 2^^ samples are usually considered interesting. 
In this case the performance gain is about 55 and the peak performances is 
obtained for vector length of 2^*^ where the GPU provides more than 45 GFlops 
of computing power. 

A remark, these high performances are achievable only if vectors data reside 
already on GPU device memory, otherwise data need to be first moved from 
host memory to device memory, paying a not negligible cost in terms of 10 
and computation delay. By that we evaluated with a benchmark the Host 
memory - Device memory communication overhead, that technically depends 
mainly on PCI-E bus speed. For example, in our signal detection case, we 
can encounter 10 operation during the startup/init activity, reduction function 
and correlators maximum handling. We explored this aspect, estimating the 
10 transfer time changing the vector data length exchanged between the two 
memories. These results are reported in figure |3] where we show the number 



GPU Performance test 

FFT single precision - memcpy - GTX 275 
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Figure 3: In this plot we report the number of operations per secconds. The considered 
operations are: host-to-device memory copy, FFT alone and the combination of the two 
operations, simulating the overhead due to lO. These results show how the GPU has to be 
treated as a real stand-alone system and not as a co-processor, working as match as possible 
on GPU side and avoiding un-hidden lO 

of copies per second(squarc symbol) respect to different vector data lengths. 
Moreover we show the equivalent number of FFT per second (triangle symbol) 
applied on the same data, without memory copy (these are the same results used 
for [2]) and the combination of Memory Copy and FFT algorithm (star symbol). 
These results provide the FFT performances that can be achieved when FFT 
is applied on data loaded each time from host memory to Device memory. As 
effect the memory copy step produces a performance loss of 40% respect to a 
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pure FFT computation. By these results we can make a consideration about 
GPU programming style, learning how using GPU just as a co-processor, for 
example replacing fftw library functions with CUFFT library function , it is 
not in principle the best design. In order to get top performances from GPU 
architecture it is important to organize the code having an high computing 
GPU density and writing GPU oriented code. The work reported on this paper 
follows these rules, applied to a fully GPU based detection pipeline. 

3.3. template generation and normalization 

The first step of the detection procedure is the template generation and 
nor malizat ion . 

3.3.1. template generation 

The our knowledge about coalescing binaries signal is only theoretical and 
the waveforms are only available with an approximation regime. The Post 
Newtonian [? ]GWapprox2,GWapprox3) approximation of order 2.5 in phase 
and in amplitude has been used in the current version of our prototype library, 
but there are no particular limitations on implementing other generators with 
higher accuracy level. The chosen approximation is enough for the purpose of 
this paper, because it permits to have a speed up factor between GPU and 
CPU that, at first order, can be considered roughly constant respect to the 
approximation order. 

Looking to figure ([!]), the processing steps implementing template gener- 
ation and normalization are: cuInspiraLhost Jnit, culnspiral_device_init, 
cuInspiral_device_template_proc(). 

The most computing demanding part is the template processing, where the 
two polarization of the template signal are generated, FFT transformed and 
normalized. 

The normalization is the procedure used to define a template scale fac- 
tors in such a way that < 6*+, 6+ >= 1/2 and < 0x , >= 1/2. This operation 
has been implemented in GPU with a complex product and reduction algo- 
rithms, which uses GPU shared memory to speed up the computation and the 
CPU as final accumulator. In detail we uses device shared memory to speed 
up the N ~ X^w^'S'^W operation. CUDA provide a fast shared memory region 
of 16KB for each multiprocessor that can be shared among threads running 
on the same multiprocessor. This can be used as a user-managed cache and 
moreover it is much faster than global memory of the GPU device, permitting 
higher bandwidth operations. To apply such solution to our problem we im- 
plement a reduction procedure where we divide the computation in sub-domain 
Ns = X]jv/s ^'^^'^M' "which result is stored into shared memory as first step, 
and than copied into host memory, performing the final N = Y\ Ng . For more 
information about this technique please refer to CUDA SDK[22[. For the FFT 
we use CUFFT, as previously described and profiled. Here we report perfor- 
mance details about the kernels in charge of generating the gw waveforms. In 
figure [4] we show the benchmark results, where the signal length is reported 
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in the horizontal axis and the generation time (fihed markers) and generation 
sample(empty markers) per second are reported in the vertical axis. The test 
measures the average time spent on signal generation for the two polarizations 
h\,h\ for both architectures, GPU and CPtH. Considering the range reported 



Template generation performance 

gensralor PNZ single precision on GPU (GTX 275) and CPLI(lntel E6550@2.33GHz) 
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Figure 4: In this plot we report the performances comparison between GPU and CPU about 
coalescing binary signal generation. With this benchmark we measure a gain of two order of 
magnitude in the region close to vector size=2^''. The generation used is a PN 2.5 in phase 
and zero in amplitude. In general we think to obtain the same gain factor also usign other 
more accurated generators. 

in figure (|4]), GPU implementation provides a gain factor of x[80-120], where 
performances increase for longer vectors. This behavior can be better observed 
looking on the normalized curves, representing the number of signal samples 
generated per second. In the CPU case performances are roughly constant 
while vector size increases, but in the GPU case the execution time per sample 
decreases. 

GPU implementation of GW signal generators permits to change perspective 
about gw signal generation and handling problem. In fact, actually the CPU 
based detection algorithms try to optimize this part essentially using bufferiza- 
tion strategies, avoiding the templates regeneration at each processing loop but 
introducing a lot of complexity in the code. 

Results, reported in this work about the signal generation, show a perfor- 
mance boost of two orders of magnitude, that can permit to change approach, 
trying a brute force procedure, namely generating on-the-fly the complete 
waveforms at each processing loop. In this way we have margin to simplify 
the pipeline code significantly, but keeping anyway the big performance gain. 



"^A CPU version of the code has been written. 
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3.3.2. Numerical accuracy 

Another test has been performed about the numerical accuracy of the gen- 
eration phase, comparing the single-precision of GPU generator respect to the 
CPU double precision counterpart. In detail, since the equation 12.11 is used in 
signal analysis to compare two signals, we can use the same equation to estimate 
the mismatch between them. That can be also seen in another way, knowing 
that the l2.1l is the filter that maximize the signal-to- noise ratio, it gives us infor- 
mation on how match energy we can extract and detect from the given signal. 
So, we use the matched filter definition (eq |2.ip . < 9cpu,&Gpu >, to write the 
mismatch function; 

^ _ I _ < ^Cpu , Sgpu > 
< Qcpui Qcpu > 

In case of two of perfectly identical signals normalized to 1 , it gives as result 
(no lose); while it gives a value > in all the other cases, where the output 
value is proportional to the amount signal energy lost. 

We evaluate the signal generation numerical error, applying the mismatch 
function between GPU and GPU gw signal for the same GB source, exploring a 
parameter space in the range [1 — 100] Ms. The measured mismatch error 
between the two signals is lower than 0.03%. In GB analysis usually a 
much higher incertitude of the order of few percent ([1 — 5]%) is introduced by 
stochastic process and template space quantization. Thus, we can affirm that 
this is a largely tolerable error. 

Moreover, in order to evaluate the error due to FFT algorithm we perform 
the same comparison in frequency domain. In this case, obtained results report 
an higher error level of about 0.1%, imputable to FFT numerical error, but also 
in any case it is completely tolerable. 

4. GPU based detection pipeline performances and profiling 

In this section we report the pipeline performances and profiling analysis, in 
order to evaluate the performance of our prototype library and to get details 
about each used algorithm. In some sense this section report the most important 
result of this work, because the processing time needed to complete the pipeline 
for each template processed is the main number of merit that can be use to make 
comparison (with eventually the appropriate correction and normalization) with 
other pipelines built with the same procedure. 

The test has been produced, building up a benchmark over the GPU GB 
pipeline that generates a template grid uniformly distributed and applies such 
grid of templates to a given input, using the pipeline discussed in this paper. 
Timing information about each computing step are stored. 

The configuration of the simulation was: 

• Intel E6550 2.33GHz 

• 2GB RAM DDR2 @800MHz 
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• GPU board NVIDIA GTX275 



• PCI-E v2 

• Ubuntu 9 OS 

We report also some technical details about graphical board and the envi- 
ronment: 

• 240 processing cores, distributed on 30 Multiprocessor with 8 cores each 
one. 

• CoreShadcrMcmory clock rate of 0.6/1.4/2 Ghz 

• Memory Bandwidth of 127 GB/s (1GB GDDR3) 

• 1010 Declared GFlops (MADD + MULL) 

• CUDA driver version 2.3 

• CUDA capability version 1.3 

Given a vector length of 2^", CB signal source [1.4, 1.4] Ms and sample rate of 
AkHz, the culnspiral pipeline is completely processed in: pipeline time/template 
= 28ms (average time) 

This is a very impressive result, if compared to the actual pipeline perfor- 
mances. Our prototype GPU CB pipeline gains more than a factor 50, pro- 
viding a good idea about the implication of these new computing architecture 
on the GW analysis and physics. 

In more detail, we report the CB pipeline profiling, shown in figure [SJ By 
that we can make some considerations. The first is that the processing phase 
template on-the-fly generation + FFT + normalization occupies roughly 
55% of the pipeline time. The second consideration is that the reverse FFT 
and the correlator computation (made adding in quadrature the two partial 
correlators), weigh similarly for the 15% of the time. 

Vector complex multiplication of the matched filtering formula and the cor- 
relator output above threshold detection are less critical, with an impact in the 
whole time of roughly 5%. 

We made a test also to explore the stability of the computation for long 
run. This had been done keeping the vector length fixed to 2^'^, and processing 
a template bank with more than 1000000 elements and recording the timimng 
informations. Results are shown in figure[71 where we report only the first 12000 
loops. Results report a very stable behavior with time. During the test GPU 
temperature increases from 51 to 84 Celsius degree during the first 5 minutes of 
run. After it was quite stable. 
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Figure 5; In figure we report the pipeline profiling, reporting the execution time for each 
important section. These results have been obtained with vector length of 2^" . 

Online analysis considerations 

In this subsection we use our results to report some evaluation and implica- 
tion about a GPU implementation of a CB pipeline in the online analysis case. 
On doing that we estimated the maximum number of templates that a single 
GPU board can process requiring that the computation has to be fast enough 
to follow the online data acquisition. This investigation is made changing the 
template length of binary system of [1.4; lA]Ms modifying the low frequency 
cut-ofl0. Results are reported in table 14.11 where we chosen as low frequency 
cut-off: [30, 23, 15, 14]i/z. Each frequency can be associated to different region 
of interest for different gravitational wave detector generations. 



Vector Size 


Cut in frequency 


max template N 




30 Hz 


3435 


2l9 


23 Hz 


4095 


2-20 


15 Hz 


4478 


221 


14 Hz 


4279 







In figure [7] we report similar information with profiling data and labeling the 
regions of interest. In particular we can observe how at 30Hz we can compute 
online 4000 templates, that is the roughly number of templates involved in the 
analysis of first detector generation. This means that with one single board 
we can cover the actual detector online analysis requirements. This is possible 
thanks to the factor 50 obtained with the GPU CB pipeline code. 

Lower frequency cut-of are interesting for advanced detector such as Adv Virgo 



Lowest frequency cut-off correspond to longest template 
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culnspiral benchmark 

processing time per template vs vector size 




2^° 2^' 



vector size 



Figure 6: In this plot the whole and profiled pipeline execution time is reported respect to the 
vector data size. A different vector length can be roughly associated to a "region of interest" for 
different gravitational wave detector. We highlight Virgo, LIGO (I gen), AdvVirgo, AdvLIGO 
(II Gen), Einstein Telescope (III Gen) 



and AdvLigo and for third generation such as Einstein Telescope. About the 
third generation, we have to recaU that Einstein Telescope by design could use a 
low frequency cutoSl — 2Hz. Under these condition templates could be 5-6 days 
long, requiring obviously to change the standard matched filtering procedure. 
For example we can figure out to use a multi band - hierarchical approach, where 
we divide the frequency range in more bands, making first the matched filtering 
on the higher frequency band. In this region the signal is shorter and has the 
highest amount of SNR. After that "detection" phase, the other bands can be 
recombined to acquire and increase signal for the the parameters reconstruction 
phase. 



5. Conclusions 

In this paper we report the first prototype pipeline for the coalescing binaries 
signal detection completely based on GPU code. Moreover the pipeline routines 
have been organized on library called culnspiral that will be soon free released. 

Our work show how using GPU hardware solution such as NVidia CUDA/GTX275 
we can gain at least a factor factor 50 in performance respect to the similar 
CPU implementation. Obviously there are still a lot of margin of code optimiza- 
tion and moreover new innovative hardwares are under release by the producers. 
For example we expect using the new CUDA library and Fermi solution a factor 
greater than 150. 
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culnspiral pipeline profiling benchmark 

vector size=2'^20 - templates starting frequency 24Hz - fs=4kHz 



Full pipeline 

template gen+FFT+norm 

cmplx correlator 
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— madd2 
maxfind 
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Figure 7: In this plot we report the benchmark made to evaluate the stability of the 
code/hardware. In this case we run the pipeline over a very long run and we record the 
ordinal number of the template loop. At this level we not record any strange behavior. We 
have just to report a temperature increase of 35 Celsius degree (up to 85 degree) 



Now we are working on production version of culnspiral where the code has 
been more optimized and provided of a more accurate GW signal generator and 
soon of a optimized chi-squaredQ test routine. Moreover a multi-GPU version 
of the code is under construction, permitting to exploit the computing power of 
systems mounting more than one GPU board. 

In the perspective of the manycore computing and of the gravitational wave 
detectors of third generation, INFN in 2009 approved the experiment " Manycore 
Computing for future Gravitational Observatory" (MaCGO)(25lpl with the aim 
of develop a numerical library using OpenCL on which work culnspiral has an 
important role. 

Our results demonstrate how the full GPU/manycore architecture program- 
ming can be very suitable for data analysis problem of second and, in particular, 
third generation of gravitational wave detectors, such as Einstein Telescope, 
providing a impressive computational power able to access physical problem 
previously inaccessible. The implementation of these new solutions is obviously 
not at zero cost, because many-core programming architecture needs a change 
of paradigm, but the effort is largely repaid with the achievable performances 
gain as it has been demonstrated in this work. Moreover we have to highlight 
how the many-core architecture is where the common CPU design arc evolving. 
Thus, in next 10 years the actual gravitational wave data analysis (and maybe 
many other fields of physics) had to face this change if want to get power from 
moder architecture and solve the ever new and ever more computing demanding 



the writer of this paper is the MaCGO national deputy 
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problems of Gravitational wave Physics. 
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